[DistInf] Enable SGLang disagg with MoRI-io & MoRI-EP for all existing models by lcskrishna · Pull Request #154 · ROCm/MAD

lcskrishna · 2026-04-24T16:18:51Z

Motivation

This PR aims to refactor MoRI IO & MoRI EP solution to SGLang disagg models.

Summary:

Introduced DP_MODE=1 / 0 to switch between TP solution and DP solution. (Default 0)
Introduced RUN_MORI=1 to switch to mori solutions. This will be added as default in future. (Default=0)
models.yaml - Contains model information and related arguments that are required for enabling with MoRI IO & MoRI EP.
sglang_disagg_mori_io_ep.sh - contains scripts to run MoRI-IO & EP solutions for the models present in yaml file.
Some basic modifications in sglang_disagg_mori_io_ep.sh

Test Plan

export DOCKER_IMAGE_NAME=DOCKER_IMAGE;
export xP=1;
export yD=2;
export MODEL_NAME=DeepSeek-R1;
export MODEL_DIR=/shared_inference/models/deepseek-ai/;
export RUN_MORI=1;

echo "Submitting with xP=${xP} yD=${yD}"
NODE_LIST=<NODELIST>
sbatch -N 4 -n 4 -p amd-rccl --nodelist=${NODE_LIST} run_xPyD_models.slurm;

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Copilot

Pull request overview

Refactors the SGLang disaggregated (prefill/decode) launch flow to support MoRI-IO + MoRI-EP across existing models, adding a DP-vs-TP switch and a YAML-based model flag catalog.

Changes:

Added sglang_disagg_mori_io_ep.sh (MoRI transfer backend launcher) with DP_MODE support and YAML-driven per-model flags.
Updated run_xPyD_models.slurm to optionally route runs through the MoRI launcher (RUN_MORI=1), add DP-mode validation, and pass additional env knobs.
Added MoRI environment defaults (mori_ep_env.sh) and a model catalog (models.yaml); updated the disagg inference Dockerfile for a newer base and NIC backend selection.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 10 comments.

Show a summary per file

File	Description
scripts/sglang_disagg/sglang_disagg_mori_io_ep.sh	New MoRI-based disaggregated launcher with DP/TP mode selection, log-based readiness gating, and YAML model configuration.
scripts/sglang_disagg/run_xPyD_models.slurm	Adds MoRI entrypoint selection, DP-mode allowlist validation, and passes new MoRI-related env vars into containers.
scripts/sglang_disagg/mori_ep_env.sh	Introduces MoRI/SGLang multi-node environment defaults and DP-mode tuning knobs.
scripts/sglang_disagg/models.yaml	Centralizes per-model CLI flags for TP vs DP launch modes.
docker/sglang_disagg_inference.ubuntu.amd.Dockerfile	Updates base image and adds NIC backend build-time selection logic for MoRI.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-24T16:26:46Z

+_run_mori="${RUN_MORI:-0}"
+if [[ "$_run_mori" == "1" ]]; then
+    if model_allows_mori_ep "$MODEL_NAME"; then
+        RUN_FILE="sglang_disagg_mori_io_ep.sh"
+        echo "RUN_MORI=1: using $RUN_FILE for model '$MODEL_NAME'"
+    else
+        echo "Error: RUN_MORI=1 but MODEL_NAME '$MODEL_NAME' is not in MORI_EP_VALID_MODELS" >&2
+        printf "MoRI EP allowed models:\n" >&2
+        for m in "${MORI_EP_VALID_MODELS[@]}"; do
+            printf "  - %s\n" "$m" >&2
+        done
+        exit 1
+    fi
+    if [[ ! -f "$RUN_FILE" ]]; then
+        echo "Error: RUN_MORI=1 requires '$RUN_FILE' in $(pwd)." >&2
+        exit 1
+    fi
 else
-    echo "Error: Model '$MODEL_NAME' not found in MODEL_RUNFILES"
-    echo "Available models: ${!MODEL_RUNFILES[@]}"
-    exit 1
+    RUN_FILE="sglang_disagg_server.sh"
+    echo "RUN_MORI not set: using $RUN_FILE"    
 fi


PR description says RUN_MORI “will be added as default”, but here _run_mori defaults to 0 and the script explicitly logs “RUN_MORI not set”. If the intended behavior is to enable MoRI by default, consider flipping the default to RUN_MORI=1 (or clarify the PR description / add a TODO noting it’s intentionally not default yet).

Copilot · 2026-04-24T16:26:47Z

+    if [[ ! -f "$RUN_FILE" ]]; then
+        echo "Error: RUN_MORI=1 requires '$RUN_FILE' in $(pwd)." >&2
+        exit 1
+    fi


RUN_FILE is checked as a relative path (-f "$RUN_FILE") and later used to build RUN_FILE_FULL based on MOONCAKE_REPO_DIR=$(pwd). In Slurm jobs the working directory depends on where sbatch was invoked, so submitting from outside this folder can break unexpectedly. Consider resolving paths relative to the script location (e.g., SCRIPT_DIR=$(cd "$(dirname "$0")" && pwd)) and basing MOONCAKE_REPO_DIR / RUN_FILE_FULL on that.

Copilot · 2026-04-24T16:26:47Z

 # Define valid model names
 VALID_MODELS=( \
    "Qwen3-32B" \
-    "Mixtral-8x7B-v0.1" \
+    "Mixtral-8x7B-Instruct-v0.1" \
    "Llama-3.1-8B-Instruct" \
    "Llama-3.1-405B-Instruct-FP8-KV" \
    "amd-Llama-3.3-70B-Instruct-FP8-KV" \
    "DeepSeek-V3" \
+    "DeepSeek-R1" \
 )


Model naming is now Mixtral-8x7B-Instruct-v0.1 in VALID_MODELS, but other launch helpers in this directory still reference Mixtral-8x7B-v0.1 (e.g., scripts/sglang_disagg/salloc_launch.sh:8 and scripts/sglang_disagg/sglang_disagg_server.sh model config maps). This inconsistency will cause validation failures or unexpected defaults depending on which entrypoint is used; consider updating the other scripts or supporting both aliases here.

Copilot · 2026-04-24T16:26:47Z

+    set -x
+    eval "$ROUTER_CMD" 2>&1 | tee "/run_logs/${SLURM_JOB_ID:-0}/proxy_NODE${NODE_RANK}.log" >/dev/null &
+    set +x
+    proxy_pid=$!
+    echo "Router (sglang_router) started pid=${proxy_pid} (DP_MODE=${DP_MODE})"


proxy_pid=$! here captures the PID of tee (the last process in the backgrounded pipeline), not the python3 -m sglang_router.launch_router process. As a result, kill "${proxy_pid}" may not actually stop the router, leaving it running and keeping port 2322 bound. Prefer launching without a pipe (redirect directly to a log file), or use a process-group/job-control approach so you can reliably terminate the router process.

Suggested change

set -x

eval "$ROUTER_CMD" 2>&1 | tee "/run_logs/${SLURM_JOB_ID:-0}/proxy_NODE${NODE_RANK}.log" >/dev/null &

set +x

proxy_pid=$!

echo "Router (sglang_router) started pid=${proxy_pid} (DP_MODE=${DP_MODE})"

proxy_log="/run_logs/${SLURM_JOB_ID:-0}/proxy_NODE${NODE_RANK}.log"

set -x

bash -lc 'exec ${ROUTER_CMD}' >"${proxy_log}" 2>&1 &

set +x

proxy_pid=$!

echo "Router (sglang_router) started pid=${proxy_pid} (DP_MODE=${DP_MODE}, log=${proxy_log})"

Copilot · 2026-04-24T16:26:47Z

+        echo ""
+    } | tee "$PREFILL_LOG"
+    set -x
+    eval "$PREFILL_CMD" 2>&1 | tee -a "$PREFILL_LOG" >/dev/null &


Same pipeline PID issue as the router: prefill_pid=$! is the PID of tee, not the actual sglang.launch_server process, so the subsequent kill "${prefill_pid}" may not stop the server. Consider redirecting logs without tee, or track/kill the real server PID (e.g., avoid pipelines or kill the process group).

Suggested change

eval "$PREFILL_CMD" 2>&1 | tee -a "$PREFILL_LOG" >/dev/null &

eval "$PREFILL_CMD" >>"$PREFILL_LOG" 2>&1 &

Copilot · 2026-04-24T16:26:48Z

+git clone --branch v62.0 --depth 1 https://github.com/linux-rdma/rdma-core.git /tmp/rdma-core && \
+    cd /tmp/rdma-core && \
+    mkdir -p build && cd build && \
+    cmake -GNinja -DCMAKE_INSTALL_PREFIX=/usr -DNO_MAN_PAGES=1 .. && \
+    ninja &&     ninja install &&     ldconfig &&     rm -rf /tmp/rdma-core


The rdma-core source clone/build/install runs unconditionally on every launch. This adds significant startup time and introduces a hard dependency on outbound network access and build tooling inside the container; if GitHub is unavailable the entire job will fail. Consider moving this into the Docker image (preferred), or gating it behind an opt-in env var and/or a “already installed” check before rebuilding.

Suggested change

git clone --branch v62.0 --depth 1 https://github.com/linux-rdma/rdma-core.git /tmp/rdma-core && \

cd /tmp/rdma-core && \

mkdir -p build && cd build && \

cmake -GNinja -DCMAKE_INSTALL_PREFIX=/usr -DNO_MAN_PAGES=1 .. && \

ninja && ninja install && ldconfig && rm -rf /tmp/rdma-core

## Prefer baking this into the image; runtime install is opt-in to avoid

## unconditional startup cost and hard dependency on outbound network access.

INSTALL_RDMA_CORE_AT_RUNTIME="${INSTALL_RDMA_CORE_AT_RUNTIME:-0}"

if command -v ibv_devinfo >/dev/null 2>&1; then

echo "RDMA userspace already present; skipping runtime rdma-core install."

elif [[ "${INSTALL_RDMA_CORE_AT_RUNTIME}" == "1" ]]; then

git clone --branch v62.0 --depth 1 https://github.com/linux-rdma/rdma-core.git /tmp/rdma-core && \

cd /tmp/rdma-core && \

mkdir -p build && cd build && \

cmake -GNinja -DCMAKE_INSTALL_PREFIX=/usr -DNO_MAN_PAGES=1 .. && \

ninja && ninja install && ldconfig && rm -rf /tmp/rdma-core

if ! command -v ibv_devinfo >/dev/null 2>&1; then

echo "ERROR: INSTALL_RDMA_CORE_AT_RUNTIME=1 was set, but rdma-core install did not complete successfully." >&2

exit 1

fi

else

echo "Skipping runtime rdma-core install (set INSTALL_RDMA_CORE_AT_RUNTIME=1 to enable this temporary workaround)."

fi

Copilot · 2026-04-24T16:26:48Z

+NODE_RANK="${NODE_RANK:-0}"
+MODEL_PATH=$MODEL_PATH
+MODEL_NAME="${MODEL_NAME:-}"
+xP="${xP:-1}"
+yD="${yD:-1}"
+DP_MODE="${DP_MODE:-0}"
+# PARALLEL_MODE is derived strictly from DP_MODE for models.yaml (tp vs dp flags).
+if [[ "$DP_MODE" == "1" ]]; then
+    PARALLEL_MODE=dp
+else
+    PARALLEL_MODE=tp
+fi
+echo "PARALLEL_MODE=${PARALLEL_MODE} (DP_MODE=${DP_MODE})"
+
+if [[ -z "${MODEL_NAME:-}" ]]; then
+    echo "ERROR: MODEL_NAME not set, exiting"
+    exit 1
+fi


MODEL_PATH=$MODEL_PATH will silently set MODEL_PATH to an empty string if it wasn't provided, and the script never validates that it is non-empty before using --model-path ${MODEL_PATH}. Add an explicit check (similar to MODEL_NAME) and use a safe default/assignment (e.g., MODEL_PATH="${MODEL_PATH:-}") so misconfiguration fails fast with a clear error.

Copilot · 2026-04-24T16:26:48Z

 RUN pip install --upgrade sglang-router

+# the default already installs mori with AINIC. Reinstall from source for cx7 and bnxt; ainic only clears USE_* overrides.
+ARG NIC_BACKEND="cx7"


In a Dockerfile, ARG default values are taken literally (quotes are not stripped). With ARG NIC_BACKEND="cx7", the value becomes "cx7", so the later case "${NIC_BACKEND}" in cx7) will not match and will fall into the error branch. Use ARG NIC_BACKEND=cx7 (no quotes), and document quoting in the docker build --build-arg example if needed.

Suggested change

ARG NIC_BACKEND="cx7"

ARG NIC_BACKEND=cx7

Copilot · 2026-04-24T16:26:48Z

+    set -x
+    eval "$DECODE_CMD" 2>&1 | tee -a "$DECODE_LOG" >/dev/null &
+    set +x
+    decode_pid=$!
+


Same pipeline PID issue as the router/prefill: decode_pid=$! captures tee's PID, not the sglang.launch_server PID, so kill "${decode_pid}" may fail to stop the decode server cleanly. Recommend launching the server without a pipeline or killing the process group so the real server process is terminated.

Copilot · 2026-04-24T16:26:49Z

+export MODELS_YAML MODEL_NAME PARALLEL_MODE
+eval "$(python3 - <<'PY'
+import os
+import shlex
+import sys
+import yaml
+
+config_path = os.environ["MODELS_YAML"]
+model_name = os.environ["MODEL_NAME"]
+mode = os.environ["PARALLEL_MODE"]
+
+with open(config_path, "r", encoding="utf-8") as f:
+    models = yaml.safe_load(f) or {}
+
+if model_name not in models:
+    print(f'echo "ERROR: Model {model_name} not found in {config_path}"; exit 1')
+    sys.exit(0)
+
+cfg = models[model_name] or {}
+prefill = cfg.get("prefill", {}) or {}
+decode = cfg.get("decode", {}) or {}
+
+
+def q(v):
+    return shlex.quote(str(v if v is not None else ""))
+
+
+exports = {
+    "MODEL_BASE_FLAGS": cfg.get("base_flags", ""),
+    "MODEL_MODE_FLAGS": cfg.get(f"{mode}_flags", ""),
+    "MODEL_PREFILL_FLAGS": prefill.get(mode, ""),
+    "MODEL_DECODE_FLAGS": decode.get(mode, ""),
+    "MODEL_EXPERIMENTAL_FLAGS": cfg.get("experimental_flags", ""),
+}
+
+for key, value in exports.items():
+    print(f"{key}={q(value)}")
+PY
+)"
+
+PREFILL_MODEL_CONFIG="${MODEL_BASE_FLAGS} ${MODEL_MODE_FLAGS} ${MODEL_PREFILL_FLAGS} ${MODEL_EXPERIMENTAL_FLAGS}"
+DECODE_MODEL_CONFIG="${MODEL_BASE_FLAGS} ${MODEL_MODE_FLAGS} ${MODEL_DECODE_FLAGS} ${MODEL_EXPERIMENTAL_FLAGS}"


This uses eval to ingest YAML-provided flag strings and later eval again to execute the assembled command. Because the YAML values can include shell expansions (and the current YAML intentionally contains $(seq ...)), this effectively allows arbitrary shell execution via models.yaml. If you only need list expansion for flags like --cuda-graph-bs, prefer representing them as explicit space-separated values in YAML (no $(...)) and build the python3 argv as a Bash array to avoid eval entirely.

Enable SGLang disagg with MoRI-io & MoRI-EP

45d8199

Copilot AI review requested due to automatic review settings April 24, 2026 16:18

lcskrishna requested review from Rohan138, amathews-amd, coketaste, gargrahul and ppalaniappan-amd as code owners April 24, 2026 16:18

Copilot started reviewing on behalf of lcskrishna April 24, 2026 16:20 View session

Copilot AI reviewed Apr 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DistInf] Enable SGLang disagg with MoRI-io & MoRI-EP for all existing models#154

[DistInf] Enable SGLang disagg with MoRI-io & MoRI-EP for all existing models#154
lcskrishna wants to merge 1 commit intoROCm:developfrom
lcskrishna:csrikris-pr222

lcskrishna commented Apr 24, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 24, 2026

Uh oh!

Copilot AI Apr 24, 2026

Uh oh!

Copilot AI Apr 24, 2026

Uh oh!

Copilot AI Apr 24, 2026

Uh oh!

Copilot AI Apr 24, 2026

Uh oh!

Copilot AI Apr 24, 2026

Uh oh!

Copilot AI Apr 24, 2026

Uh oh!

Copilot AI Apr 24, 2026

Uh oh!

Copilot AI Apr 24, 2026

Uh oh!

Copilot AI Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	eval "$PREFILL_CMD" 2>&1 \| tee -a "$PREFILL_LOG" >/dev/null &
	eval "$PREFILL_CMD" >>"$PREFILL_LOG" 2>&1 &

Conversation

lcskrishna commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Test Plan

Submission Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lcskrishna commented Apr 24, 2026 •

edited

Loading